Updated: Wed Feb 27 19:01:33 2019
Read the dataset in with proper column names:
## user system elapsed
## 0.489 0.055 0.617
Applying pre-processing on our dataframe:
library(tidyr)
ipaddr <- separate(data=ipaddr,
col="Coords",
into=c("Lat", "Lng"),
sep = ",",
remove=TRUE)
ipaddr[,c("Lat", "Lng")] <- lapply(ipaddr[,c("Lat", "Lng")], as.numeric)
## Classes 'tbl_df', 'tbl' and 'data.frame': 258626 obs. of 9 variables:
## $ IP : chr "222.76.212.189" "222.76.212.185" "222.76.212.186" "5.34.246.67" ...
## $ Reliability: int 4 4 4 6 4 4 4 4 4 6 ...
## $ Risk : int 2 2 2 3 5 2 2 2 2 3 ...
## $ Type : chr "Scanning Host" "Scanning Host" "Scanning Host" "Spamming" ...
## $ Country : Factor w/ 152 levels "CN","US","UA",..: 1 1 1 2 3 2 1 1 1 NA ...
## $ Locale : Factor w/ 2572 levels "Xiamen","Merefa",..: 1 1 1 NA 2 3 1 1 1 NA ...
## $ Lat : num 24.5 24.5 24.5 38 49.8 ...
## $ Lng : num 118.1 118.1 118.1 -97 36.1 ...
## $ x : chr "11" "11" "11" "12" ...
Simple visualization of the dataframe:
library(ggplot2)
library(hrbrthemes)
# get top 20 countries
top_countries <- names(table(ipaddr$Country)[1:20])
ggplot(data=subset(ipaddr, Country %in% top_countries),
aes(x=reorder(Country, Country, length))) +
geom_bar() +
coord_flip() +
labs(title="Country by appearance frequency",
subtitle="Number of countries in the IP reputation database",
x="Country", y="Count") +
theme_ft_rc()
ggplot(data=subset(ipaddr, Country %in% top_countries),
aes(x=reorder(Country, Lng), y=Lng)) +
geom_boxplot() +
coord_flip()+
labs(title="Median of Lng by Country")+
theme_ipsum_rc()
An IPv4 address consists of 32 bits. This becomes clear when you convert the decimal notation into the binary equivalent: 201.105.7.34 corresponds to 11001001 01101001 00000111 00100010. In our dataset, our each of the IP address is also given a score on its Risk and Reliability:
## Reliability 1 2 3 4 5 6 7 8 9 10
## Risk
## 1 0 0 16 7 0 8 8 0 0 0
## 2 804 149114 3670 57653 4 2084 85 11 345 82
## 3 2225 3 6668 22168 2 2151 156 7 260 79
## 4 2129 0 481 6447 0 404 43 2 58 24
## 5 432 0 55 700 1 103 5 1 20 11
## 6 19 0 2 60 0 8 0 0 1 0
## 7 3 0 0 5 0 0 0 0 2 0
## Risk Reliability Frequency
## 9 1 3 16
## 15 1 4 7
## 25 1 6 8
## 31 1 7 8
## 1 2 1 804
## 7 2 2 149114
## # A tibble: 6 x 10
## IP Reliability Risk Type Country Locale Lat Lng x
## <chr> <int> <int> <chr> <fct> <fct> <dbl> <dbl> <chr>
## 1 222.… 4 2 Scan… CN Xiamen 24.5 118. 11
## 2 222.… 4 2 Scan… CN Xiamen 24.5 118. 11
## 3 222.… 4 2 Scan… CN Xiamen 24.5 118. 11
## 4 5.34… 6 3 Spam… US <NA> 38 -97 12
## 5 178.… 4 5 Scan… UA Merefa 49.8 36.1 11
## 6 66.2… 4 2 Scan… US Union… 37.6 -122. 11
## # … with 1 more variable: simple_type <fct>
## Risk Reliability Type Freq
## 1 1 1 C&C 0
## 2 2 1 C&C 0
## 3 3 1 C&C 0
## 4 4 1 C&C 1
## 5 5 1 C&C 2
## 6 6 1 C&C 1
~ is 126This original address definition means the first eight bits (octet) of the 32-bit address is the most important as it, being the network number field, specified the particular network a host was attached to. The remaining 24 bits uniquely identified a host connected to that network. This format was sufficient at a time when only a few large networks existed (since only the first octet specifies the network number, only 254 indepedent networks are supported at that time). This necessitates the introduction of address classes, and hence a new addressing architecture was introduced in 1981 as part of the specification of the Internet Protocol. This new architecture divides the address space into three classes (class A, B, C).
The number of addresses usable is always \(2^N-2\) where N is the number of bits. The subtraction of 2 adjusts for the use of the all-bits-zero host portion for network address and the all-bits-one host portion as a broadcast address
The later versions of the IPv4 addressing system broadly separates IP addresses into 5 classes, which can be identified by the first octet of IP address. There are 4 octets so here by “first octet” we’re referring to the 127 in “127.100.100.0” (similarly, 0 is the fourth octet).
Classful addressing definition:
| Class | Theoretical Address Range | Binary Start | Used for |
|---|---|---|---|
| A | 0.0.0.0 to 127.255.255.255 | 0 | Very large networks |
| B | 128.0.0.0 to 191.255.255.255 | 10 | Medium networks |
| C | 192.0.0.0 to 223.255.255.255 | 110 | Small networks |
| D | 224.0.0.0 to 239.255.255.255 | 1110 | Multicast |
| E | 240.0.0.0 to 247.255.255.255 | 1111 | Experimental |
This architecture extended the addressing capability in the Internet but did not prevent IP address exhaustian. Many sites required larger address blocks than a Class C network provided (\(2^{8}-2 = 254\)) and thus receive a Class B block, rapidly depleting the availability of Class B addresses. Classful networking was replaced by Classless Inter-Domain Routing (CIDR) in 1993 to solve this problem.
Class D and E not shown for brevity:
| Classes (A to C shown) | Bit-wise representation | Notation |
|---|---|---|
| Class A start: 0.0.0.0 | 00000000.00000.00000.00000.0000 | 0nnnnnnn HHHHHHHH HHHHHHHH HHHHHHHH |
| Class A end: 127.255.255.255 | 01111111.00000.00000.00000.0000 | 0nnnnnnn HHHHHHHH HHHHHHHH HHHHHHHH |
| Class B start: 128.0.0.0 | 10000000.00000000.00000000.00000000 | 10nnnnnn nnnnnnnn HHHHHHHH HHHHHHHH |
| Class B end: 191.255.255.255 | 10111111.11111111.11111111.11111111 | 10nnnnnn nnnnnnnn HHHHHHHH HHHHHHHH |
| Class C start: 192.0.0.0 | 11000000.00000000.00000000.00000000 | 110nnnnn nnnnnnnn nnnnnnnn HHHHHHHH |
| Class C end: 223.255.255.255 | 11011111.11111111.11111111.11111111 | 110nnnnn nnnnnnnn nnnnnnnn HHHHHHHH |
We’re going to be making use of the unlist and strsplit functions to perform numerical computations on our IP addresses. Here’s a primer if you’re unfamiliar with the two functions:
## [1] "192" "168" "100" "0"
## [1] 192 168 100 0
We will also be using a left reduce function, which applies a function to the elements of a vector:
## [1] 460
## [1] 460
Combining these two, we can now write the IP to Integers function. Don’t worry about the functions from bitops for now:
library(bitops)
# Define function for converting IP addresses to/from integers
# take an IP address string in dotted octets (e.g. 192.168.0.1)
# and convert it to a 32-bit long integer (3232235521)
ip2long <- function(ip){
# convert string into vector of characters
# 255.0.0.17 -> c(255, 0, 0, 17)
ips <- unlist(strsplit(ip, ".", fixed=TRUE))
# bit-shift, then "OR" the octets
octet <- function(x, y){
bitOr(bitShiftL(x,8), y)
}
Reduce(octet, as.integer(ips))
}
# take a 32-bit IP address (3232235521)
# and convert it to a dot notation
long2ip <- function(longip){
# set up reversing bit manipulation
octet <- function(nbits){
bitAnd(bitShiftR(longip, nbits), 0xFF)
}
paste(Map(octet, c(24, 16, 8, 0)), sep="", collapse=".")
}
## [1] "192.168.0.1"
## [1] 3232235521
In the context of classful IP Addresses, one may think of writing an extraction function to determine the class membership of an IP address based on what we’ve learned. However, to the extent that CIDR has rendered classful IP addresses obsolete. Protocols can no longer assume they can deduce the prfix length or subnet mask from the first part of the address; Instead, all prefix lengths are explicitly communicated: classless.
Recall that in the classful structure,
This was very inefficient for networks that didn’t fit these sizes. A network that needed 4096 addresses would either get sixteen Class C delegations (which would be bad for the global routing table because each of them would have to be routed separately: the class size was built into the protocol) or they would get one Class B delegation (which would waste a lot of addresses).
In 1993 CIDR was introduced. The protocols were adjusted to be able to deal with prefixes of different sizes and it became possible to route (both internally and externally) prefixes like a /30 or a /21 or a /15 etc etc. Anything between /0 and /32 became possible. Organisations that needed 16384 addresses could get a /18: exactly what they would need instead of the Class B delegation that consume 65,534 addresses.
CIDR encompasses the concept of variable-length subnet masking (VLSM) technique, which allows for the specification of arbitrary-length prefixes. CIDR introduced a new representation of IP addresses by prefixing an address with a suffix indicating the number of bits of the prefix, such as 192.0.2.0/24 for IPv4 and 2001:db8::/32 for IPv6. CIDR also introduces an administrative process of allocating address blocks to organizations based on their actual and short-term projected needs.
CIDR is based on the idea of subnet masks. A mask is placed over an IP address and creates a sub network: a network that is subordinate to the internet. The subnet mask signals to the router which part of the IP address is assigned to the hosts (the individual participants of the network) and which determines the network.
Instead of adding a subnet mask, a specification in the form of suffixes can also be integrated directly into the IP address using classless inter-domain routing. An integral fact pertaining CIDR is the concept of VLSM: the variable length subnet mask allows subnets to be realized with variable lengths and not only in size order of the network classes.
CIDR notation is a compact representation of an IP address and its associated routing prefix. The notation is constructed from an IP address, a slash (‘/’) character, and a decimal number. The number is the count of leading 1 bits in the subnet mask. Larger values here indicate smaller networks. The maximum size of the network is given by the number of addresses that are possible with the remaining, least-significant bits below the prefix.
192.168.100.14/24 represents the IPv4 address 192.168.100.14 and its associated routing prefix 192.168.100.0, or equivalently, its subnet mask 255.255.255.0, which has 24 leading 1-bits192.168.100.0/22 represents the 1024 IPv4 addresses from 192.168.100.0 to 192.168.103.255An IP address has two components: - Network address
- Host address
A subnest mask separates the IP address into the network and host address (
A subnet mask is referred to as such because it identifies network addresses of an IP addresses by performing a bitwise AND operation on the netmask.
## [1] 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 1 0 0 0 1 1 0 1 1 1 0 0 0 1 0 0
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1] 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A subnet mask is a 32-bit number that masks an IP address: it is made by setting network bits to all 1s and setting host bits to all 0s. Examples of commonly used netmasks for classed networks are 8-bits (Class A), 16-bits (Class B) and 24-bits (Class C), and classless networks are as follows:
| Class | Address | # of Hosts | Netmask (Binary) | Netmask (Decimal) |
|---|---|---|---|---|
| CIDR | /4 | 240,435,456 | 11110000 00000000 00000000 00000000 | 240.0.0.0 |
| CIDR | /5 | 134,217,728 | 11111000 00000000 00000000 00000000 | 248.0.0.0 |
| CIDR | /6 | 67,108,864 | 11111100 00000000 00000000 00000000 | 252.0.0.0 |
| … | … | … | … | … |
| A | /8 | 16,777,216 | 11111111 00000000 00000000 00000000 | 255.0.0.0 |
| CIDR | /9 | 8,388,608 | 11111111 10000000 00000000 00000000 | 255.128.0.0 |
| … | … | … | … | … |
| B | /16 | 65,534 | 11111111 11111111 00000000 00000000 | 255.255.0.0 |
| CIDR | /17 | 32,768 | 11111111 11111111 10000000 00000000 | 255.255.128.0 |
| CIDR | /18 | 16,384 | 11111111 11111111 11000000 00000000 | 255.255.192.0 |
| … | … | … | … | … |
| CIDR | /22 | 1,024 | 11111111 11111111 11111100 00000000 | 255.255.252.0 |
| C | /24 | 256 | 11111111 11111111 11111111 00000000 | 255.255.255.0 |
| … | … | … | … | … |
| CIDR | /30 | 4 | 11111111 11111111 11111111 11111100 | 255.255.255.252 |
To see the conversion between binary to integers, we can use the strtoi function in R. For example, in the case of /4 and /5, the respective binary addresses begin with:
## [1] 240
## [1] 248
From the examples above, we observe the CIDR notation written as the address of a network followed by a slash and ending with the bit-length of the prefix. 198.51.100.0/24 thus allocated 24 bits for the network prefix with the remaining 8 bits reserved for host addressing. In other words: addresses in the range of 198.51.100.0 to 198.51.100.255 belong to this subnet.
Subnet masks are also expressed in dot-decimal notation like an address. For example, 255.255.255.0 is the subnet mask for the prefix 198.51.100.0/24. Traffic is exchanged between subnetworks through routers when the routing prefixes of the source address and the destination address differ. A router serves as a logical or physical boundary between the subnets.
In the address allocation architecture of the Internet using CIDR and in large organizations, it is necessary to allocate address space efficiently. Subnetting may also enhance routing efficiency, or have advantages in network management when subnetworks are administratively controlled by different entities in a larger organization. Subnets may be arranged logically in a hierarchical architecture, partitioning an organization’s network address space into a tree-like routing structure.
Another advantage lies in security since all hosts in a subnetwork see all packets transmitted by any node in a network. In addition, performance of a network is adversely affected under heavy traffic load due to collisions and retransmissions.
We’ve learned that applying a subnet mask to an IP address separates network address from host address. The network bits are represented by the 1’s in the mask, and the host bits are represented by 0’s. When we apply the Class C subnet mask to the IP address 216.3.128.12, we get through the bit-wise AND operator:
IP: 1101 1000.0000 0011.1000 0000.0000 1100 (216.003.128.012) Mask: 1111 1111.1111 1111.1111 1111.0000 0000 (255.255.255.000)
Produces: 1101 1000.0000 0011.1000 0000.0000 0000 (216.003.128.000)
## [[1]]
## [1] 1 1 0 1 1 0 0 0
##
## [[2]]
## [1] 0 0 0 0 0 0 1 1
##
## [[3]]
## [1] 1 0 0 0 0 0 0 0
##
## [[4]]
## [1] 0 0 0 0 1 1 0 0
Another scenario where subnetting is used is one where a web host with a Class C network needing to divide the network such that parts of its network can be leased to its customers. Assuming that the network address given is 216.3.128.0 and the company wants to divide the network into 2: first half for internal usage and second half to its tenants / customers:
- 216.3.128.(0000 0000): assigned to web host
- 216.3.128.(1000 0000): assigned to customers
The web host will have the subnet mask of 216.3.128.128 (/25). Now, we’ll further divide the 2nd half into eight block of 16 IP addresses: - 216.3.128.128 (strtoi("1000000", base=2)) Customer 1 gets 16 IPs (14 usable)
- 216.3.128.144 (strtoi("1001000", base=2)) Customer 2 gets 16 IPs (14 usable)
- 216.3.128.160 (strtoi("1010000", base=2)) Customer 3 gets 16 IPs (14 usable)
- …
Subnet mask of 255.255.255.(1111 0000) / subnet mask of 255.255.255.240
## [1] 128
## [1] 144
## [1] 160
The classful address scheme (Class A, B and C) of allocating IP addresses in 8-bit increments can be very wasteful. With classful IP addressing, a minuimum number of IP addresses allocated to any organization is 256 (Class C). When a company only requires 15 but receive 256 addresses, waste is incurred. An organization that requires more than 256 (for example, 500) is assigned a Class B, which allocates 65,536 and an organization that is assigned Class A is allocated 16.8 million IP addresses.
With CIDR, a network of IP addresses is allocated in 1-bit increments as opposed to 8-bits in classful network. The use of a CIDR notated address can easily represent classful addresses (Class A = /8, Class B = /16, and Class C = /24). The number next to the slash (i.e. /8) represents the number of bits assigned to the network address.
When we’re performing CIDR-based analyses, a common task is to determine whether an address falls within a given CIDR range. In R code, we are converting both the IP address and network block address to integer then performing the necessary bitwise operations to see if they do match up.
ip_in_cidr <- function(ip, cidr){
long_ip <- ip2long(ip)
cidr_parts <- unlist(strsplit(cidr, "/"))
cidr_range <- ip2long(cidr_parts[1])
cidr_mask <- bitShiftL(bitFlip(0),
(32-as.integer(cidr_parts[2]))
)
return(bitAnd(long_ip, cidr_mask) == bitAnd(cidr_range, cidr_mask))
}
ip_in_cidr("10.0.1.15", "10.0.1.3/24")
## [1] TRUE
ip_in_cidr("10.0.1.15", "10.0.2.255/24")
## [1] FALSE
With CIDR, blocks of addresses can be treated as groups of nodes: we can use them to identify network packets (incoming or outgoing traffic) and therefore classify groups of malicious / legitimate nodes. A machine learning engineer can then use these groupings to dig into the data and relationships and extract potential features.
Furthermore, once we are familiar with the CIDR prefix format, we can see how those prefixes are grouped and defined as an autonomous system (AS). An AS is a fancy term but what it really mean is a collection of connected Internet Protocol (IP) routing prefixes under the control of one or more network operators on behalf of a single administrative entity that presents a common and clearly defined routing policy to the internet.
Originally the definition required control by a single entity, typically an ISP or a very large organization with independent connections to multiple networks, that adhered to a single routing policy. That ISP must have an officially registered autonomous system number (ASN). A unique ASN is allocated to each AS and uniquely identifies each network on the internet. ASN are assigned in blocks by IANA to regional internet registries (RIRs) and the RIR in turn assign ASNs to entities within its designated area from the block assigned by IANA.
These Autonomous Systems must have an officially registered autonomous system number (ASN), which they get from their Regional Internet Registry: AFRINIC, ARIN, APNIC, LACNIC or RIPE NCC
A newer definition came into use because multiple organizationscan run Border Gateway Protocol (BGP) using private AS numbers to an ISP that connects all those organizations to the internet. BGP is a protocol designed to exchange routing and reachability information among autonomous systems (AS) on the internet; it makes routing decisions based on paths, network policies or rule-sets configured by a network administrator and is involved in making core routing decisions.
A unique ASN (AS Number) is allocated to each AS for use in BGP routing. AS numbers are important because the ASN uniquely identifies each network on the Internet.
ggplot2 Visualization of malicious nodesTaking a peek at the dataset:
## IPAddr Reliability Risk Type Country Locale LAT
## 4 5.34.246.67 6 3 Spamming US 38.0000
## 10 174.142.46.19 6 3 Spamming 24.4798
## 16 112.216.121.87 4 3 Scanning Host KR 37.0000
## 17 112.216.121.78 4 3 Scanning Host KR 37.0000
## 18 112.216.121.77 4 3 Scanning Host KR 37.0000
## 19 112.216.121.75 4 3 Scanning Host KR 37.0000
## LNG
## 4 -97.0000
## 10 118.0819
## 16 127.5000
## 17 127.5000
## 18 127.5000
## 19 127.5000
##
## C&C Malicious Host Malware IP Scanning Host Spamming
## 463 3180 5782 31854 1291
When two routers have established connection for exchanging BGP information, they are referred to as BGP peers. Such BGP peers exchange routing information between them via BGP sessions that run over TCP, which is a reliable, connection oriented & error free protocol.
Once the BGP Session is established, the routers can advertise a list of network routes that they have access to and will scrutinize them to find the route with the shortest path. BGP doesn’t make sense when a peer is connected only to one other peer because that peer is always going to be the best and only path to other networks. When a network is connected to multiple peers then certain paths will be shorter, faster or more reliable than others. Google’s AS15169 peers with 265 other IPv4 peers one of which is Digital Ocean Inc.’s AS14061. Both are connected to other ISPs on the internet, but since they are peers they can also exchange routing information, giving routers within their network a shorter path of connectivity. If the neighborship is broken, their routers can re-arrange their routing tables to connect with other Autonomous Systems through other peers / AS.
Because of the relationship between ASN and BGP, it’s possible to know the adjacent neighbors of each AS; if one AS “neighborhood” is rife with malicious nodes or suspicious activity it might a leading indicator that autonomous systems around it are also harboring malicious traffic.
For comparison, if we refer to the DNSlytics for as14061: Digital Ocean and compare it against as7393: Cybercon, we can get some quick summary of the IPs that are hosted on each AS. Notice that, for example, the number of spam domains hosted on as14061 is 1101-to-1 for example as compared to 8669-to-1.
## Variables as14061 as7393
## 1 Assignee DigitalOcean Cybercon
## 2 Number of domains hosted 1,667,967 52,014
## 3 Number of adult domains hosted 5,636 754
## 4 Number of name servers hosted 70,825 1,449
## 5 Number of SPAM hosts hosted 1,515 6
## 6 Number of open proxies hosted 195 0
## 7 Number of mail servers hosted 223,061 27,664
## 8 Number of IDN domains hosted 1,620 0
## 9 Number of domains in Alexa top million 17,939 500
In the following chapter, we’ll draw upon lessons from traditional exploratory data analysis and visual exploration techniques to help us in our process of insight discovery.
In the following subsection, we’ll start by augmenting our data with additional meta-data from the IANA IPv4 Address Space Registry. The dataset we’ll read in represent a high level grouping of IPv4 address space registry allocations - although it should be said that most of the registrants are not responsible for the activities of individual nodes. We’ll use this extra information for what it’s intended: to set up further investigations.
## Prefix Designation Date WHOIS
## 1 000/8 IANA - Local Identification 1981-09
## 2 001/8 APNIC 2010-01 whois.apnic.net
## 3 002/8 RIPE NCC 2009-09 whois.ripe.net
## 4 003/8 Administered by ARIN 1994-05 whois.arin.net
## 5 004/8 Level 3 Parent, LLC 1992-12 whois.arin.net
## 6 005/8 RIPE NCC 2010-11 whois.ripe.net
## RDAP Status..1.
## 1 RESERVED
## 2 https://rdap.apnic.net/ ALLOCATED
## 3 https://rdap.db.ripe.net/ ALLOCATED
## 4 https://rdap.arin.net/registryhttp://rdap.arin.net/registry LEGACY
## 5 https://rdap.arin.net/registryhttp://rdap.arin.net/registry LEGACY
## 6 https://rdap.db.ripe.net/ ALLOCATED
## Note
## 1 [2]
## 2
## 3
## 4
## 5
## 6
Cleaning up iana$Prefix since it uses the old/BSD-number formatting (allows leading zeros and we don’t need to know the CIDR component):
iana$Prefix <- sub(pattern = "^(00|0)", replacement="", x = iana$Prefix, perl = TRUE)
iana$Prefix <- sub(pattern = "/8$", replacement="", x = iana$Prefix, perl = TRUE)
tail(iana)
## Prefix Designation Date WHOIS RDAP Status..1. Note
## 251 250 Future use 1981-09 RESERVED [16]
## 252 251 Future use 1981-09 RESERVED [16]
## 253 252 Future use 1981-09 RESERVED [16]
## 254 253 Future use 1981-09 RESERVED [16]
## 255 254 Future use 1981-09 RESERVED [16]
## 256 255 Future use 1981-09 RESERVED [16][17]
Next, we want to extract just the prefix from the original AlienVault list of IP addresses. We do this using sub to perform replacement on all characters from the first “.” and onwards:
data$IPPrefix <- sub("\\..*","", data$IPAddr)
head(data$IPPrefix,10)
## [1] "222" "222" "222" "5" "178" "66" "222" "222" "222" "174"
We’ll now create a Designation variable by assigning the appropriate value of iana$Designation based on the data$IPPrefix values:
data$Designation <- sapply(data$IPPrefix, function(ip){
iana[iana$Prefix == ip, "Designation"]
})
summary(data$Designation, maxsum=8)
## APNIC RIPE NCC ARIN
## 93776 74789 42358
## LACNIC Administered by ARIN Administered by RIPE NCC
## 18914 18008 5893
## Administered by APNIC (Other)
## 2615 2273
Quickly compare the distribution of registry present in our dataset against that of the external IANA data:
combined_Designation <- merge(
data.frame(table(data$Designation)),
data.frame(table(iana$Designation)), by = "Var1"
)
colnames(combined_Designation) <- c("Registry", "AlienVault_Count", "IANA_Count")
head(combined_Designation[order(combined_Designation$IANA_Count, decreasing = TRUE), ],8)
## Registry AlienVault_Count IANA_Count
## 3 Administered by ARIN 18008 55
## 8 APNIC 93776 45
## 10 ARIN 42358 36
## 30 RIPE NCC 74789 35
## 21 Future use 0 16
## 27 Multicast 0 16
## 25 LACNIC 18914 9
## 2 Administered by APNIC 2615 6
We can compute correlation between the number of malicious nodes attributed to a registry (AlienVault_Count) and the number of assigned network blocks from our IANA data. It makes sense that there would be more malicious nodes in larger groups of assigned network blocks.
## [1] 0.6898625
By default, the cor() computes the correlation using the “pearson” method, but we could as well have specified the method to be “spearman” or “kendall”.
The different values for our method parameter corresponds to the different correlation methods:
- pearson: \(\frac{\sum{(x-\bar{x})(y-\bar{y})}}{\sqrt{\sum{(x-\bar{x})^2 \sum{(y-\bar{y})^2}}}\) or alternatively: cov(x,y)/(sd(x)*sd(y)) - spearman: \(R_s = 1-(\frac{6\sum{d}^2}{n^3-n})\) where \(d^2\) is the squared difference between the ranks of two values and n is the number of observations
- kendall: not discussed in details here, but refers to Kendall’s Tau (Kendall’s Rank Correlation)
For this chapter, we’ll work with the ZeuS botnet malware. It is a crimeware kit that steals credentials from online services like social networks, online banking accounts, FTP accounts, email accounts and other (phishing). Our data is gathered from the ZeuS tracker site, and carries the following description:
This blocklist contains the same data as the ZeuS IP blocklist (BadIPs) but with the slight difference that it doesn’t exclude hijacked websites (level 2) and free web hosting providers (level 3). This means that this blocklist contains all IPv4 addresses associated with ZeuS C&Cswhich are currently being tracked by ZeuS Tracker. Hence this blocklist will likely cause some false positives.
## IP
## 1 101.200.81.187
## 2 103.19.89.118
## 3 103.230.84.239
## 4 103.4.52.150
## 5 103.7.59.135
## 6 104.247.219.41
In the following section, we will be using some preprocessing functions from the netintel library and also some custom functions:
To see a few of the functions at work, we’ll try with BulkOrigin which retrieves the BGP Origin ASN for a list of IPv4 addresses:
BulkOrigin("182.253.162.193")
## AS IP BGP.Prefix CC Registry Allocated
## 1 17451 182.253.162.193 182.253.162.0/24 ID apnic 2010-05-10
## AS.Name
## 1 BIZNET-AS-AP BIZNET NETWORKS, ID
Another example is the BulkOriginASN which retrieves BGP Origin ASN info for a list of ASN ids:
BulkOriginASN("AS17451")
## AS CC Registry Allocated AS.Name
## 1 17451 ID apnic 2000-11-07 BIZNET-AS-AP BIZNET NETWORKS, ID
We can also retrieve BGP Peer ASN info for a list of IP addresses:
BulkPeer("182.253.162.193")
## Peer.AS IP BGP.Prefix CC Registry Allocated
## 1 6453 182.253.162.193 182.253.162.0/24 ID apnic 2010-05-10
## 2 6939 182.253.162.193 182.253.162.0/24 ID apnic 2010-05-10
## 3 58463 182.253.162.193 182.253.162.0/24 ID apnic 2010-05-10
## 4 196844 182.253.162.193 182.253.162.0/24 ID apnic 2010-05-10
## Peer.AS.Name
## 1 AS6453 - TATA COMMUNICATIONS (AMERICA) INC, US
## 2 HURRICANE - Hurricane Electric LLC, US
## 3 PGASCOM-AS-AP PT.PGAS TELEKOMUNIKASI NUSANTARA, ID
## 4 PIONIER-AS-AMS-IX PIONIER, National Research and Education Network in Poland, PL
Applying that function on zeus$IP, we obtain a data frame in return:
origin <- BulkOrigin(zeus$IP)
head(origin)
## AS IP BGP.Prefix CC Registry Allocated
## 1 37963 101.200.81.187 101.200.0.0/16 CN apnic 2011-01-24
## 2 132717 103.19.89.118 103.19.89.0/24 IN apnic 2013-03-28
## 3 132717 103.230.84.239 103.230.84.0/24 IN apnic 2014-04-29
## 4 46062 103.4.52.150 103.4.52.0/24 ID apnic 2011-05-19
## 5 131447 103.7.59.135 103.7.56.0/22 TH apnic 2012-03-15
## 6 46261 104.247.219.41 104.247.216.0/21 US arin 2014-12-09
## AS.Name
## 1 CNNIC-ALIBABA-CN-NET-AP Hangzhou Alibaba Advertising Co.,Ltd., CN
## 2 NDCTPL-IN NxtGen Datacenter & Cloud Technologies Pvt. Ltd., IN
## 3 NDCTPL-IN NxtGen Datacenter & Cloud Technologies Pvt. Ltd., IN
## 4 LINTASBUANA-AS-ID PT. BUANA LINTAS MEDIA, ID
## 5 POP-IDC-TH POPIDC powered by CSLoxinfo, TH
## 6 QUICKPACKET - QuickPacket, LLC, US
igraphsuppressPackageStartupMessages(library(igraph))
bookclub <- graph(edges=c("Samuel", "Mulia", "Mulia", "Jessica", "Jessica", "Samuel",
"Andita", "Jessica", "Jessica", "Tiara", "Samuel", "Tiara"),
directed = TRUE)
plot(bookclub)
title(main = "Who borrowed from who?", sub = "Source: Algoritma Book Club")
In “graph language”, vertex is a synonym for a node: the common point of two line segment. Edges (lines) are what connect two nodes in a network. In the following exercise, we’ll create a simple graph using the igraph package.
The graph (visualized):
For each IP address (origin$IP) retrieve the origin AS CC and return them as a pair to create the IP-> CC edge list
## [1] "101.200.81.187" "CN" "103.19.89.118" "IN"
## [5] "103.230.84.239" "IN"
Remember how we use + vertices to add vertices? Similarly, we can also use + edges to add edges:
g <- g + edges(unlist(ip_cc_edges))
ecount(g)
## [1] 113
vcount(g)
## [1] 226
Plotting the graph again:
plot(g, vertex.label.cex=0.4)
As an additional exercise: if you call tkplot(g) and then tinker with the various options in “Layout”, you’ll see that igraph has included an interactive graph drawing facility.
We’ll now try to simplify the graph using the simplify and delete function (both from igraph). simplify maps multiple edges to single edges. This function also provides a mechanism to specify what to do with the vertices / edge attributes in these cases: - vertex.attr.comb
- edge.attr.comb
The specification of the combination of (vertex or edge) attributes can be given as either a character scalar, a function object or a list of character scalars and/or function objects.
## [1] 226
## [1] 152
## IGRAPH 7e1b6b8 DN-- 152 113 --
## + attr: name (v/c), color (v/c), size (v/n), group (v/n)
## + edges from 7e1b6b8 (vertex names):
## [1] 101.200.81.187 ->CN 103.19.89.118 ->IN 103.230.84.239 ->IN
## [4] 103.4.52.150 ->ID 103.7.59.135 ->TH 104.247.219.41 ->US
## [7] 109.127.8.242 ->AZ 109.229.210.250->LV 109.229.36.65 ->RU
## [10] 113.29.230.24 ->SG 120.31.134.133 ->CN 120.63.157.195 ->IN
## [13] 124.110.195.160->JP 128.210.157.251->US 139.59.36.231 ->SG
## [16] 141.8.226.58 ->CH 151.97.190.239 ->IT 162.223.94.56 ->US
## [19] 176.107.179.60 ->UA 176.107.191.119->UA 177.4.23.159 ->BR
## [22] 180.182.234.200->KR 185.35.138.22 ->NL 185.6.242.251 ->IT
## + ... omitted several edges
Notice that the number of vertices has reduced from 226 to 152. We simplified the graph by comibining common edges and delete any standalone vertices (ASNs); In graph terms, this means any vertex with a degree of 0.
The plot looks like this:
By setting arrow.size to be 0 and remove the IP addresses (name) such that it only show the ASNs, we can have a cleaner graph.
E(g)$arrow.size <- 0
V(g)[grep("\\.", V(g)$name)]$name <- ""
plot(g, vertex.label.cex=0.4)
We can use the layout_on_* or layout_as_* layout options to achieve the desired graph visualization:
# L <- layout_on_sphere(g)
# L <- layout_on_grid(g)
L <- layout_as_star(g)
plot(g, layout=L,
vertex.label.dist=2,
vertex.label.cex=0.6)
We can also use layout_nicely, which choose an appropriate graph layout algorithm automatically:
set.seed(0)
L <- layout_nicely(g)
plot(g, margin=0, layout=L,
vertex.label.dist=0.7,
vertex.label.cex=0.7,
vertex.label.color="black",
vertex.label.family="sans",
vertex.label.font=2,
main="ZeuS botnet nodes clustered by Country")
dplyr system## Warning: 'plyr' namespace cannot be unloaded:
## namespace 'plyr' is imported by 'ggplot2', 'netintel' so cannot be unloaded
lattice and ggplot2From the elevation dataset, we multiply the x and y by 50 (help documentation say the coordinates are in multiples of 50 feet) and z by 10 (help documentation say the elevations are in multiples of 10). Rather than plotting the individual values what we would do instead is to create a surface that covers the whole grid region using the loess function. This function fits a local polynomial trend surface using weighted least squares to approximate the elevation across the whole region. The function for a local quadratic surface is:
We will then create a grid region that goes from 10 to 300 feet in both the x and y directions:
Use our loess model to predict and fill the grid. Notice that z is a matrix with two dimensions (x and y respectively):
Now use image() to see how our loess model fill the grid with its fitted value. Adjust the span to be 0.7 or higher and then re-run the exercise to see how our fitted values are smoother:
As a comparison, take a look at the fitted values when we use a linear model instead of our loess model:
Instead of creating a colored grid using image(), we can use levelplot from lattice:
Data on soya bean production in a uniformity trial measured at plots of size 5x5 meters and other soil properties measured in points given by the data coordinates. X and Y variable refers to the respective coordinates, PH is the soil PH value, K is the potassium content and PROD is production converted to productivity by a unit of area - hectare (ton/ha):
## X Y P PH K MO SB iCone Rend PROD
## 1 5.6 3.6 4.4 6.3 0.53 47.52 80.63 31.0 6.3980 2.56
## 2 15.2 1.6 3.7 6.3 0.38 48.74 82.23 29.5 7.0502 2.82
## 3 24.8 4.6 3.5 5.4 0.47 51.79 61.47 23.0 5.8848 2.35
## 4 34.4 5.6 3.9 5.2 0.48 57.88 59.09 24.5 8.5182 3.41
## 5 44.0 1.6 3.6 5.6 0.52 59.70 71.61 18.5 7.4749 2.99
## 6 53.6 1.6 3.7 5.2 0.45 53.61 56.98 21.5 9.3160 3.73
Anscombe (1973) has a nice example where he uses a constructed dataset to emphasize the importance of using graphs in statistical analysis. The data is built into R:
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## 4 9 9 9 8 8.81 8.77 7.11 8.84
## 5 11 11 11 8 8.33 9.26 7.81 8.47
## 6 14 14 14 8 9.96 8.10 8.84 7.04
## Warning: 'plyr' namespace cannot be unloaded:
## namespace 'plyr' is imported by 'ggplot2', 'netintel' so cannot be unloaded
## # A tibble: 4 x 6
## set `mean(x)` `sd(x)` `mean(y)` `sd(y)` `cor(x, y)`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 9 3.32 7.50 2.03 0.816
## 2 2 9 3.32 7.50 2.03 0.816
## 3 3 9 3.32 7.5 2.03 0.816
## 4 4 9 3.32 7.50 2.03 0.817
Notice that all the four sets have the same statistical properties and even the same regression coefficients; A visual exploration however surfaced 4 very different pattern within each dataset.
## [1] "greenyellow" "lightgoldenrodyellow" "lightyellow"
## [4] "lightyellow1" "lightyellow2" "lightyellow3"
## [7] "lightyellow4" "yellow" "yellow1"
## [10] "yellow2" "yellow3" "yellow4"
## [13] "yellowgreen"
Primer on Left Shift and Right shift
strtoi(00000111, base = 2)
## [1] 7
strtoi(00001110, base = 2)
## [1] 14
bitShiftL(7,1)
## [1] 14
strtoi(01000011, base = 2)
## [1] 67
# shift the bits to left by 1
strtoi(10000110, base = 2)
## [1] 134
bitShiftL(67, 1)
## [1] 134
# # shift the bits to left by 2
strtoi(100001100, base = 2)
## [1] 268
bitShiftL(67, 2)
## [1] 268
toDotNotation(268)
## [1] 0 0 0 0 1 1 0 0
toDotNotation(12)
## [1] 0 0 0 0 1 1 0 0
toDotNotation(255)
## [1] 1 1 1 1 1 1 1 1
BGP in Indonesia: https://dnslytics.com/bgp/id